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Abstract 

Several scenarios of interacting neural networks which are trained either in 
an identical or in a competitive way are solved analytically. In the case of 
identical training each perceptron receives the output of its neighbour. The 
symmetry of the stationary state as well as the sensitivity to the used training 
algorithm are investigated. Two competitive perceptrons trained on mutually 
exclusive learning aims and a perceptron which is trained on the opposite 
of its own output are examined analytically. An ensemble of competitive 
perceptrons is used as decision-making algorithms in a model of a closed 
market (El Farol Bar problem or Minority Game); each network is trained on 
the history of minority decisions. This ensemble of perceptrons relaxes to a 
stationary state whose performance can be better than random. 
Simple models of neural networks describe a wide variety of phenomena in neurobiology 
and information theory. Neural networks are systems of elements interacting by adaptive 
couplings which are trained by a set of examples. After training they function as content 
addressable associative memory, as classifiers or as prediction algorithms. Using methods of 
statistical physics many of these phenomena have been elucidated analytically for infinitely 
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large neural networks |],@]. 

Most studies of feed-forward neural networks have concentrated on a single network learn- 
ing a fixed rule, which is usually a second network, the so-called teacher. The teacher network 
is presenting examples, sets of input/output data, and the student network is adapting its 
weights to this set of examples. In an on-line training scenario each example is presented 
only once, hence training is a dynamical process fH^]]. The teacher network may also gen- 
erate a time series of output numbers |jU§, and the student learns by following the time 
series. The weights of the teacher network are fixed in this scenario. 

Many phenomena in biology, social and computer science may be modeled by a system 
of interacting adaptive algorithms (see e.g. 0). However, little is known about general 
properties of such systems. In this paper we derive an analytic solution of a system of 
interacting neural networks. Each network is a simple perceptron with an iV-dimensional 
weight vector. These networks receive an identical input vector, produce output bits and 
learn from each other. In Section |, each network is trained by the output of its neighbour, 
with a cyclic flow of information. By iterating the training step for randomly chosen input 
vectors, the dynamical process relaxes to a stationary state. In the limit of N — > oo we 
describe the process by ordinary differential equations for a few order parameters, similiar 
to the usual student /teacher scenario |3]||. We identify the symmetries of the stationary 
state and find phase transitions when increasing the learning rate of the training steps. 

In Section and |T| we study different training scenarios with two interacting perceptrons 



and various learning algorithms. 



In Section [Ty| we apply the system of interacting networks to a problem of game theory 
called the minority game, which is derived from the El-Farol Bar problem |P,[10||. We consider 
a set of agents who have to make a binary decision. Each agent wins only if he/she belongs 
to the minority of all decisions. This process is iterated. Each agent has to develop an 
algorithm which makes a decision according to the history of the global minority decisions. 



The problem recently received a lot of attention in the context of statistical physics [JTT 
Here we follow a novel approach: Each agent uses a perceptron for making his/her decision, 



and each perceptron is trained on the minority of all output bits. 

I. MUTUAL LEARNING, SYMMETRIC CASE 

In this section we investigate a system of interacting neural networks as follows: several 
identical networks are arranged on an oriented ring. All networks receive an identical input 
and produce different output according to their weight vectors. Each network is trained by 
the output of its neighbour on the ring. This process is iterated until a stationary state is 
reached in which the norms and angles between the weight vectors no longer change. We 
are interested in the properties of this stationary state. 

We consider the simplest feed-forward networks, an ensemble of K simple perceptrons, 
which are represented by iV-dimensional weight vectors Wj (i = 1, . . . , K) and which map a 
common input vector x onto binary outputs <jj = sign(x-Wj). As order parameters we use 
the norms Wi = |wj| and the respective overlaps Rij = Wj-Wj or cos(^) = Wi-Wj/tUiWj. 
When only two perceptrons are considered, the subscript is dropped: cos(6 l ) = w 1 -w 2 /w 1 ^ 2 . 
The components of the input vector (or pattern) are Gaussian with mean and variance 1, 
yielding x x = O(N). 

The updates are of the form 

w+ = w t + (r ]i /N)f(a i ,s)sx (1) 

for unnormalized weights or 

w + = w f + (ri i /N)f((T i ,s)sx 

W * \w i + (r h /N)f(a il s)sx\ 1 ' 

for normalized Wj. The + denotes a quantity after one learning step, r/i is the learning 
rate, s is the desired output, and /(o^, s), the so-called weight function, defines the learning 
algorithm. We mostly use / = 1 (the Hebbian rule, called H from now on) and / = 0(— cr^s) 
(the perceptron learning rule, abbreviated P |TJ), and the respective variations where the 
Wj are kept normalized, denoted as HN and PN respectively. 
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We derive differential equations for the order parameters in the thermodynamic limit 
iV — > oo by taking the scalar product of the update rules and introducing a time variable 
a = p/N, where p is the number of patterns shown so far. We use the analytic tools which 
were previously developed for the teacher/student scenario ]||§. If the order parameters 
are self-averaging (see |T2| for criteria of self-averaging in this context), integrating over the 



distribution of patterns gives deterministic differential equations for the order parameters 
as N — > oo. The required averages are listed in the appendix. 

A. Perceptron learning rule 

We first restrict ourselves to two perceptrons that try to come to an agreement by learning 
the output of the respective other perceptron. 

For rule P with identical learning rates r\\ — r\i — r], the update rule is 

= wi + -xff 2 0(-ffi o- 2 ); 
w% = w 2 + — x 0-1 (-0-1 cr 2 ). (3) 

The sum of both vectors is conserved under this rule: if a learning step takes place, it has the 
same direction and absolute value, but different signs for the two vectors. This conservation 
can be used to link w± and W2 to cos (9): assuming that W\ = W2 = w and starting from 
6> = 7r/2, simple geometry gives w /\^2 = cos(9/2)w. The conservation is also visible in the 
differential equations that can be derived using the described formalism: 

=(l-cos(0)) + -^— ; (4) 



da V27T 2u>i7r ' 

;i - cos(e)) + ; (5) 



da V27T 2W7 2 7T ' 

' IH V --(l-cos(9))(w 1 + w 2 )-r ] 2 -. (6) 



da a/27t " vr 

If the right-hand side of § and ||] vanish, so does |6]. There is a curve of fixed points of the 
system given by the equation 
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Using the relation w = Wq/(^/2cos(9/2)), this can be solved numerically to give the fixed 
point of cos(6 l ) as a function of the scaled learning rate tj/wq, as shown in Fig. [l|. For small 
learning rates, the perceptrons come to good agreement, while large 77 leads to antiparallel 
vectors. 

Geometrically, this can be understood as follows: each learning step has a component 
parallel to the plane spanned by wi and W2, which decreases the distance between the vec- 
tors, and a perpendicular component, which increases the distance (see Fig. ^|). Equilibrium 
is reached when a typical learning step no longer changes the angle, i.e. the vectors stay on 
a cone around wi + W2. The radius of this cone increases with growing rj. 

B. Perceptron learning with normalized weights 

A similar calculation can be done for the perceptron learning rule with normalized weights 
(PN), where the length wi of the weight vectors is set to 1 after each step. The perceptrons 
move on a hypersphere of radius 1; in equilibrium, the average learning step leads back onto 
that sphere before the vectors are normalized again. 

We derive the following differential equation for R = cos(#): 

£? = (* + 1) {f-nd -*>-^)- 00 

Fixed points are R — 1, R — — 1 and 

jn e _ = 1 (9) 

It is not a coincidence that this is equivalent to (0) if w is set to 1. The fixed point of @ 
at R — 1 is repulsive; the one at R — — 1 is unstable for 77 < 4/y / 27r = 1.60. A solution of 
(0) can only be found for 77 < rj c = 1.816, which corresponds to cos(#) = —0.689. 

Simulations show that the system relaxes to the fixed point given by Eq. (^|) for r\ < r\ c 
and jumps to R — — 1 for larger rj (see Fig. |3|). This behaviour shows the characteristics of 
a first-order phase transition. 



Hence for small learning rates the two perceptrons relax to a state of nearly complete 
agreement, 9 ~ 0. Increasing rj leads to a nonzero angle between the two vectors up to 
9 pd 133°. At this rate the system jumps to complete disagreement, 9 = 180°. 



C. Mutual learning on a ring 



The mutual learning-scenario can be generalized to K perceptrons: perceptron i learns 
from perceptron i + 1 if they disagree, with cyclic boundary conditions. Under rule P, the 
total sum of vectors is conserved again: as many perceptrons take a step in one direction as 
in the opposite. 

Performing the necessary averages for the equations of motion would involve Gaussian 
integrals over K — 1 correlated variables with B-functions - it is not clear to us whether 
this can be done analytically in general cases. However, we find in simulations that the 
fixed point for rule P is completely symmetric: there is only one angle 9 between all pairs 
of perceptrons. Assuming that relation ([?[) still holds, and using the conservation of w i) 
one can derive 



The largest angle that the perceptrons can take is cos(6 l ) = —\j{K — 1), corresponding to a 
.fT-cornered hypertetrahedron. This happens when | S w i| is negligible w.r.t wi. Simulations 
confirm that (|10|) holds, as can be seen in Fig. [I] 

Similar to the case of two networks, all perceptrons agree with each other for small 



where all mutual angles between the K weight vectors are identical, 9^ = 9. Note that the 
symmetry is higher than the topology of the flow of information (the ring). For high rates 
i] — > oo the system relaxes to a state of maximal disagreement, i.e. the largest possible 
mutual angle 9 that is still compatible with a symmetric arrangement. 

For rule PN, the sum of the weights is not preserved. The fixed point of the dynamics 
follows the curve for two normalized weights described by Eq. (§) in a completely symmetric 



7] 9 




(10) 




learning rate r\ 



0. For larger rates the system relaxes to a state of high symmetry 
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configuration. When the hypertetrahedron angle is reached and J2 w i vanishes, the sym- 
metry is partly broken. There are now different angles to nearest neighbours, next-nearest 
neighbours etc., so the angles split up into (K — l)/2 different branches for odd K and 
K/2 — 1 for even K. Note that the system still has the symmetry of the ring. 

With odd K, increasing r\ increases the angle between nearest neighbours, up to some 
limit value. This angle is not the maximum nearest-neighbour angle allowed for by the 
geometric constraints, but seems to decrease with increasing K. 

In the case of even K, simulations show a second transition at some higher value of i], 
where the vectors split into two antiparallel clusters, thus maximizing the nearest-neighbour 
angle. The learning rate at which this transition typically appears during the run of the 
program increases with N. The conclusion is that the antiparallel fixed point is not stable 
in the N — > oo limit, but de facto stable in simulations because the self-averaging property 
of the ODEs breaks down at this point. 

One may ask which symmetries survive if the perceptrons are allowed to have different 
individual learning rates. A close look reveals that for rule P, there is a more general con- 
served quantity: X)f Simulations show that the angles 9^ again relax to a completely 
symmetric configuration depending on the average 77 and the initial value of the new con- 
served quantity, while the norms Wi are proportional to the respective learning rates r^. For 
rule PN, variations in the learning rates not only lead to slightly different curves for each of 
the angles with individually different r) c , they also suppress the transition to the antiparallel 
state that is observed for even K. 

D. Hebbian learning 

The reason why P and PN lead to antiparallel orientation of the weight vectors for larger 
learning rates is that they concentrate on cases where the networks disagree. Algorithms 
that reinforce what both networks agree on are more successful, as can be seen for rule H 
for two perceptrons. 
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The differential equations are 

dwi [2 v 2 
_ = ^-oos(fl) + _; 

d ^ = v^ + ^) + v 2 (i- 2 A (ID 

This system has no common fixed point, which means that the Wi grow without bounds. The 
asymptotic behaviour can be seen from the equation for cos(#). Assuming that W\ = W2 = w, 
we find 

dcos(9)_ n 4 ;i _ cos( ^ ) + i? ;A (()) _2e\ 



da w \p2rx w 2 



71 



By taking w ~ y 2/77770;, the ODE leads to 1 — cos(0) oc a~ 4 for a — > 00. This means that 
oc a -2 . 

Simulations agree with the numerical integration of Eqs. (|TTD, with the exception of very 
large a and correspondingly small 9 (see Fig. ^). This is not surprising, since the a~ 2 -decay 
is an effect of patterns that are classified differently. As long as the perceptrons give the 
same output on all patterns, w\ and w 2 grow linearly, but the difference w x — w 2 does not 
change, leading to 9 oc a~ l . This is observed in simulations for small angles, where no 
patterns happened to be classified differently on the considered timescale. Mathematically, 
this is related to a breakdown of the self-averaging properties of Eqs. (Ill]) at the point 9 = 0. 



II. MUTUAL LEARNING, COMPETITION 

In the previous section, all of the neural networks behave in the same way. Each percep- 
tron tries to learn the output of its neighbour, and only the initial weight vectors are chosen 
randomly and differ from each other. Now we investigate a scenario where two networks 
behave differently. Network 1 is trying to simulate network 2 while 2 ist trained on the 
opposite of the opinion of 1. This scenario describes a competition between two adaptive 
algorithms. If 2 is completely successful, the overlap is cos(#) = —1, and perceptron 1 always 
fails in its prediction, and vice versa. A motivation from game theory can be drawn from 
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the game of penny matching, where both players make a binary decision simultaneously. 
One player wins if the decisions are the same, the other if they are different. 



A. Rule P 

If both perceptrons use rule P for their respective learning aim, the update rules are 

= wi + (i] 1 /N)^a 2 Q(-cricr2); 

W% = W 2 - (?7 2 /7V)x(7i©((7i(7 2 ). (13) 

The corresponding differential equations for the order parameters are 

dwi 771 rjl 9 

— — = =(1 - cos(6>)) + 

da y^r 2w 1 n 

dw 2 m ( , . (Q .. . rfi . 6 

— — = 7=(l + cos H 1 ; 

da v ;; 2w 2 K 71' 



dR r) x w 2 , y V2W1 

'1 — cos(0)) 7=(1 + cos(0)). (14) 



da V27T v2tt 

The common fixed point for these equations is Wi = v / 27r^i/4, cos(0) = 0. This is hardly 
surprising, since none of the perceptrons has a better algorithm than the other. The learning 
rate only rescales the weight vectors; the ratio rji/wi, which determines how fast the direction 
of Wj in weight space can change, is independent of r\ at the fixed point. 

B. Rule H 

The picture is slightly different if both perceptrons learn from every pattern they see. 
The resulting differential equations are 

dwi 12 , n . 77? 

1 -771 cos(0) + ' ' 



da V TT 2w 1 ' 

dwo 2 ril 

- ~ -1/-772 COS(0) + 



da V TT 2w 2 J 
dR [2 [2 

^ = W -VlW2 - y -V2W! - 77 1 7 72 (7T - 20). (15) 
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The fixed point of R is reached if 9 = tt/2 and r\\jw\ = ^2/^2, i.e. the vectors are per- 
pendicular and the scaled learning rates rji/wi are the same for both perceptrons. Under 
these conditions, the equations for Wi can be solved: Wi = rji(a + (u>j,o/'>7i) 2 ) 1//2 , so Wi shows 
the A/a-scaling typical for random walks. Geometrically, the Hebb rule adds corrections to 
the weight vector that are on average parallel to the teacher vector. Since the teacher is 
moving at the same angular velocity as the student, the movement of both vectors resembles 
a random walk. Again, 77 only sets the temporal and spatial scale. 



C. Rule P vs. rule H 



The result of the competition becomes more interesting when both perceptrons use differ- 
ent algorithms. For example, we let perceptron 1 use rule P, while 2 uses H. The derivation 
of the differential equations is again straightforward: 

— - = £=(1 - cos(6>)) + — — ; 

da 2 Wl 7i' 

- --W-77 2 cos(0)+ • 



da V 7T 2w 



2 



dR 2 y]\w 2 r} X r) 2 9 

- = -y + -=(1 - cos(e)) + — . (16) 



They have a common fixed point defined by 



9- 



cos(#) 



2 



71 



l-cos(fl)) 2 4' 

Vi 



2u 1 - cos(0); 



Wi = 

V27T T] 2 

w 2 = — ; 77T7- (17) 

4 cos(6>) 

These equations can be solved numerically and yield cos(#) = 0.459, w\ = 0.806^i and 
w 2 = 1.37?72. Although perceptron 1 makes less use of the provided information, it wins the 
competition: the perceptron using rule H has a smaller ?7/w-ratio and is thus less flexible. 
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D. Normalized weights 



By setting the weights to 1 after each learning step, a new length scale is introduced, 
leading to a more complex dependence of the solution on the learning rates. For brevity, we 
only give the differential equations for the different learning rules and explain some common 
features. If both networks use rule PN, the ODE is 

^ = 4=Mi - R) - V2(i + R)) + -5=fai(i -R)+ + R)) - + vl^ - 0)); 

da y/2iT \/2tt 2,-k 

(18) 

for rule HN we find 

^ = ]fl (V2 ~ Vi)(R 2 - 1) - ~ (Vl + Vl) ~ VMI - 2 A (19) 
and if rule PN is used by perceptron 1 and HN by 2, the equation is 

— -R\^- V2 R- Y + - / =(1-R)- Y -J 

+^(l-R)- ] fl m + VlV2 9 -. (20) 
The behaviour of the fixed point is similar in all cases (see Fig. |5|): 

• if, say, r)2 is fixed and 771 — > 0, R goes to a value R 7^ — 1. This is expected, since both 
PN and HN only achieve finite values of R for fixed teachers. 

• if both perceptrons use the same algorithm with the same learning rate, the result is 
R = 0, as expected. 

• if r]i — > 00 for either i, R — > 0. Infinite learning rate means that in every time step the 
perceptron discards all the information it previously had, replacing it with the current 
±x. Theoretically, that makes it predictable for the other network; in practice, both 
agents are confused. The notable exception is the case of PN vs. HN, where a 
non- vanishing R results if both r\i — > 00 with a finite ratio 
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III. CONFUSED TEACHER 



For any prediction algorithm there is a bit sequence for which this algorithm fails com- 
pletely, with 100% error fll3| . In fact, such a sequence is easily constructed: Just take the 



opposite of the predicted bit at each time step. In Ref. |13j a perceptron was used for the 
prediction algorithm. 

Here we do not consider bit sequences. However, it turns out that many statistical 
properties of the prediction algorithm are similar when random inputs are used instead of 
a window of the antipredictable bit sequence. Hence we consider the following scenario: 
Preceptron 1 is trained on the negative of its own output. Perceptron 2 is trained on the 
output of perceptron 1. 

This is similar to the teacher/student model where the teacher weight vector performs 
a random walk || . But here the teacher is "confused" , it does not believe its own opinion 
and learns the opposite of it. 

The update rule of perceptron 1 now only depends on its own output: 

w + = Wl - (ry/AQxoi. (21) 

Geometrically speaking, the vector performs a directed random walk in which every learning 
step has a negative overlap with the current vector. An equilibrium length is reached when 
a typical learning step leads back onto the surface of an A-dimensional hypersphere. This 
fixed point of w\ is easily calculated to be 

Wi = V^rj/A = 0.6267//, (22) 

and the weight vector typically moves on the surface of a hypersphere of that radius. 

A. Rule H 

What happens if a second perceptron tries to follow the output of the confused teacher? 
Again, the results depend entirely on the used algorithm. The simplest case, the Hebb rule, 
also has a geometrical interpretation that is revealed by a look at the update rule: 
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w + = Wl - (r)/N)x.ax; 

w + = w 2 + (rj/N)^!. (23) 



As in section [1 A| , the sum of both vectors is constant, so there is a class of solutions to the 
ODEs 

dwi 12 rj 2 

■'-7) + 



da V 7T 2wi 
dw 2 [2 rj 2 

-T) COS{U) + 



da V 7T 2w- 



2 



dR / 2 

-r~ = \ -V{ w i ~ w 2 cos(9)) +7] 2 (24) 
da V 7T 



cos W = -(l + -f-) I • (25) 



defined by Wij = y2jir]/A and w 2 j = —y^T]/ (4 cos{6)). The solution is given by the initial 
condition, i.e. the initial sum |wi+W2|. The fixed point angle can be calculated by applying 
the cosine theorem to a triangle with side lengths n>i,/, w 2 j and |wi + w 2 |; starting from 
perpendicular vectors of norm wq, one finds 

Geometrically, for large learning rate rj both norms become much larger than wo] the only 
way to achieve this while keeping the sum constant is a large angle. For small r], W\ becomes 
very small compared to the sum, and thus to w 2 . So the direction of w 2 stays nearly 
unchanged while wi performs its random walk, leading to nearly perpendicular vectors on 
average. 



B. Rule P 



If perceptron 2 uses rule P, the sum of the vectors is not conserved, and a simple 
geometrical interpretation is not possible. However, the equations of motion can still be 
solved: 



0. 



dwi _ 12 rj 
da V 7r 2w~i 
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:(1 — COS(fc')) 



da J2~k 2w 2 7r ' 



dR 2 w\Ti , 9 , . 

^-i7" C03ie) -7S (1 - c) -"^ (26) 



The fixed point of cos(#) is given by the solution of 49 j-n = (1 + cos(#)) 2 , independent from 
rj. The numerical solution is 9 = 0.7777T, cos(#) = —0.761, w 2 = 0.552?? (in accordance with 
Ref. ||13| , where a special case of this problem was solved). Remarkably, the generalization 
error is larger than 50% - even the "smarter" perceptron learning rule predicts the behaviour 
of the confused teacher with less success than random guessing would. 



C. Optimal learning rule 

This raises an interesting question: is there any "reasonable" algorithm for perceptrons 
that allows them to track the confused teacher? If there are algorithms that achieve a 
positive overlap, one of them has to be the rule that optimizes student-teacher overlap in 
each time step - the optimal weight function derived by Kinouchi and Caticha |L4| : 



, w 2 tan(0) 



(x-w 2 ) 2 



' (27) 



$(0-iX-w 2 /(w 2 tan(0)))' 



2tan(#) 2 w\ 

where $(x) = Jf 00 exp(— z 2 /2) / \f2ixdz. If wi is set to its fixed point for simplicity's sake, 
calculation yields the following ODEs for cos(6 l ) and w 2 : 

^cos(fl) _ 1 sin(g) 2 I 2 

da " 4tt cos(fl) V2^ Wl cos(fl) ' 

^ = ^tan(#) 2 /, where (29) 

r J— ( 1 + cos(g) 2 2 \ i 

/ / -^=exp I — 2sin( ^ 2 x j ^_ xcom)(l){xcom f x - 
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Calculating whether cos(#) = is in fact a fixed point of the confused teacher/optimal 
student scenario is problematic, since the optimal weight function ( |27|) diverges at 9 = 7r/2. 
However, the numerical solution of Eqs. (|28|) and ( |29"D shows clearly that even starting from 
cos(9) = 1, the system evolves towards 9 = it/ 2, which indeed seems to be the upper limit 
for success. Simulations of the learning process again agree weel with our theory (see Fig. 
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D. Rule HN 



There is a way of achieving a positive overlap with the confused teacher with simple 
learning rules: if the teacher perceptron is "slowed down" by keeping its weights normalized 
and setting rj to some small value, a student using PN or HN can track the teacher nearly 
perfectly for very small learning rates. For simplicity's sake, let us consider HN with 
identical learning rates. The differential equation for R is 

the fixed points are R = — 1 or R = + 1. This result is again confirmed by 

simulations, as seen in Fig. [5]. The fixed point goes to 1 as rj — > 0. 

IV. PERCEPTRONS IN THE MINORITY PROBLEM 

The concept of interacting neural networks can be applied to a problem that has received 
much attention recently: the El Farol Bar Problem 0. The problem was originally inspired 
by a popular bar that has a limited capacity: if too many people attend, it becomes crowded, 
and patrons don't enjoy the evening. In a more special formulation, each agent out of a 
population of K decides in each time step (each Saturday evening) to take one of two 
alternatives (go to the bar or stay at home). Those agents who are in the minority win, the 
others lose. Decisions are made independently; the only information available to agents is 
the decision of the minority was in the last N time steps. 

Many papers (see e.g. |ll[) investigated a specific realisation of the model called the 
Minority Game. In this model each agent has a small number of randomly chosen decision 
tables (Boolean functions) that prescribe an action based on the previous history, and which 
of the tables is used is decided according to how successful each one was in the course of the 
game. It turned out that the success of the game depends on the ratio between the number 
of players and the size of the history window, and general conclusions on the behaviour of 
crowded markets were drawn [|l5i |l6|. 
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We will discuss a different approach that yields different behaviour: Each agent % is 
represented by a perceptron w< that uses the time series S t = (S t , S t -i, ■ ■ ■ , S t -N+i) of past 
minority decisions to make a prediction on the next time step. It then learns the output of 
the minority according to some learning rule. 

In our approach, all of the agents are flexible in their decisions. Each agent uses an 
identical adaptive algorithm which is trained by the history of the game, the only information 
available to each of the agents. However, each agent uses a different randomly chosen initial 
state of its network. If all weight vectors of the networks would collapse, all agents would 
make the same decision, and all would lose. If all weights remained in the random initial 
state, each agent would make a random guess which yields a reasonable performance of the 
system. Our calculation shows that training can improve the performance of the system 
compared to the random state. 



Following Ref. [fTTH , we replace the history St by a random vector x. Simulations show 
that this changes the results only quantitatively, if at all. 

This strategy fulfills the restrictions that the original problem posed: the agents do 
not communicate except through majority decisions, and individual decisions are based on 
experience (induction or learning) rather than perfect knowledge of the system (deduction). 
However, since each player uses only one strategy whose parameters can be fine-tuned to 
the current environment rather than a set of completely different strategies, no quenched 
bias in the players' behaviour is to be expected. 



A. General notes on performance 

The commonly used measure of collaboration in the minority problem is the average 
standard deviation of the sum of outputs of all agents: 

If each agent makes random decisions, one gets a 2 /K = 1. The probability of two percep- 
trons i and j giving the same output on a random pattern is 1 — Oij/rc. Any ensemble of 
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vectors w< can be thought of as centered around a center of mass C = J2iLi w i/ 'K with 
a norm C (for random vectors of length 1, C would be of order y/K). The weights can 
then be written as w* = g« + C, with Y^f=i Si = 0. F° r the sake of simplicity, we will 
assume a symmetrical configuration with = 1 and grgj — ~~ 1) f° r * 7^ i- (An 

ensemble of randomly chosen vectors of norm 1 would give g 2 = 1 — 1/K ± 0(l/vjV) and 

^ = -i/k±o(i/Vn).) 

The average overlap between different weights is now R = C 2 — 1/ (K — 1), their average 



norm Wi = VC 2 + 1. With this, Eq. (p^ ) can be evaluated: 

a 2 1 / K KK V 

^ = ^ (S 1 + Z)Z)sign(x-w i )sign(x-w i )^ 

= l + 1)^(1 -f^ p-V^- 1 ' )). (33) 

If C is set to and .fT is large, a linear expansion of the arccos term in Eq. (|33"D gives 
a 2 pt /K pa 1 — 2/tt = 0.363. The small anticorrelations (of order 1/K) between the vectors 
suffice to change the prefactor in the standard deviation. 

If C is much larger than g, there is a strong correlation between the perceptrons. Most 
perceptrons will agree with the classification by the center of mass sign(x-C). As C —>■ oo, 
a 2 / K saturates at K. 

B. Hebbian Learning 

Now each perceptron is trying to learn the decision of the minority according to rule H. 
S denotes the majority decision: 

N 

w t = w i - ^ xsi g n (H sign(x- Wi )) = w* - -xS. (34) 

As the same correction is added to each weight vector, their mutual distances remain un- 
changed. Only the center of mass is shifted. We now treat C as an order parameter: 

C+ = Sf->- (35) 

1=1 

C 2+ = C 2 - 2 ^-CS+^. (36) 
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To average over x-CS in the thermodynamic limit, we introduce a field h = x-C and average 
over x for fixed h: 

K 

x-CS = |/i|sign(^sign(/i)sign(x-gj + h)). (37) 
j=i 

The quantity sign (h) sign (x-gj + h) is a random variable with mean erf(|/i| / V2) and variance 



1 — erf(|/i| / V2) 2 . In a linear approximation for small \h\, we replace this by mean J2/7r\h\ 
and variance 1. 

For sufficiently large K, one can use the Central Limit Theorem to show that 



Y^f=\ sign (/i) sign (x ■ gj + h) becomes a Gaussian random variable with mean y2/irK\h\. 
Since the terms of the sum in fl3?D are anticorrelated rather than independent, the variance 
turns out to be (1 — 2/n)K rather than K, analogously to Eq. (1331). This yields 



^sign(£ sign(fc) sign(x- gi + h))\ = eii(^K/(n - 2)\h\). (38) 

Since h is a Gaussian variable with mean and variance C 2 , the average over hS can now 
be evaluated. We find the following differential equation for the norm of the center of mass: 

da ~ ^\1 + 2K{<k-2)C^ +V ■ [6J) 
The fixed point of C, which can be plugged into Eq. fl33|) to get a 2 /K(rj, K), is 



7T 



(see Figs. || and ^|). 

If C is large, the majority of perceptrons will usually make the same decision as C, which 
then behaves like the single confused perceptron: C — > v / 27r^/4 if -K'?? 2 — > 00 - compare to 
Eq. 

For small C, the majority may not coincide with sign(x-C). In that case, the learning 
step has a positive overlap with C, leading to C oc y/rj as rj —>■ 0. 

The derivation given is only correct if iV — > 00 and ii" is large. However, simulations 
show very good agreement even for K = 21 and iV = 100 (see Fig. P). For a smaller number 



of dimensions N, there is even a tendency towards smaller a 2 /K. This can be understood in 
the extreme case of N = 1: Each perceptron is characterized by one number; the outcome 
is decided by whether the majority of numbers is smaller than or larger, regardless of 
the "pattern". The learning step consists of shifting all numbers up or down by the same 
amount. In the case of small r], the fixed point is characterized by (N — l)/2 players firmly 
on one side of the origin, (JV — 1)/2 on the other side, and one unfortunate loser who changes 
sides at every step. 

Interestingly, if the time series generated by the minority decisions is used as patterns, the 
functions cr 2 (C) and C(rj) are quantitatively different from those found for random patterns. 
However, in the final result a 2 (rj) no disagreement can be noticed (see Fig. ^j. 

The presented Hebb algorithm may appear too simplistic and the chosen initial conditions 
too artificial. It must therefore be emphasized that there are other learning algorithms that 
lead to the same anticorrelated state. In particular, a variation of rule PN has proven 
successful in simulations (see Fig. [H]): all perceptrons that are on the minority side take a 
learning step, and weights are kept normalized. The regular rule P where perceptrons on 
the majority side move, however, leads to strong clustering and a 2 /K cx K. 

The absence of scaling behaviour if N > K and the fact that smaller dimensions (cor- 
responding to smaller memory of the time series) even improve the results show that the 
conclusions drawn from the "conventional" Minority Game do not apply to all conceivable 
strategies for the Bar Problem. We think that the dependence of a 2 /K on the ratio between 
available strategies and players is caused by the use of quenched strategies and will not arise 
in any scenario in which agents stick to one strategy which is fine-tuned by some learning 
process. 

The case of N — 1 implies that there are strategies that give a 2 /K oc 1/K. We will 
elaborate this point in another publication. 
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V. SUMMARY 



We have investigated several scenarios of mutually interacting neural networks. Using 
perceptrons with well-known on-line training algorithms in the limit of infinite system size, 
we derived exact equations of motion for the dynamics of order parameters which describe 
the properties of the system. In the first scenario a system of K perceptrons is placed 
on a ring. All perceptrons receive the same input and each perceptron is trained by the 
output of its neighbour on the ring. We have used two well-known training algorithms: 
the perceptron rule which concentrates on examples where the networks disagree, and the 
Hebbian rule where each example changes the weights. We find that with unnormalized 
weights the system relaxes to a stationary state of high symmetry: each perceptron has the 
same overlap with all others. The overlap depends on the learning rate: with increasing 77 
the perceptrons increase their mutual angle as much as possible. 

For the perceptron learning rule with normalized weights we find phase transitions with 
increasing learning rate 77. For large values of rj, the symmetry is broken, but the symmetry 
of the ring is still conserved. For the Hebbian rule we find a different behaviour. The lengths 
of the weights diverge, the mutual angles shrink to zero and the perceptrons eventually come 
to perfect agreement in the limit of infinitely many training examples. 

We furthermore study the behaviour of perceptrons that pursue competing learning aims 
for different learning algorithms. If two perceptrons follow mutually exclusive learning aims 
using the same algorithm, a draw results. If they use different rules, the outcome depends 
on factors like the rescaled learning rate r\jw. We find that a perceptron that learns the 
opposite of its own prediction cannot be tracked by a student perceptron that learns the 
positive output of the confused teacher: all rules achieve a negative overlap. 

Finally an ensemble of interacting perceptrons is used to solve a model of a closed market. 
Each agent uses a perceptron which is trained on the decision of the minority. Our analytic 
solution shows that the system relaxes to a stationary state which yields a good performance 
of the system for small learning rates 77. In contrast to the minority game of Refs. JT1J our 
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approach leads to identical profits for all agents in the long run. In addition, the performance 
of the algorithm is insensitive to the size of the history window used for the decision. 

This paper is a first step towards more complex models of interacting neural networks. 
We have presented analytically accessible cases which may open the road to a general un- 
derstanding of interacting adaptive systems with possible applications in biology computer 
science and economics. 
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The following averages are used in our calculations to derive deterministic differential 
equations from the update rules. The angled brackets denote averages over isotropically 
distributed pattern vectors. In the limit iV — > oo, wi-x and w 2 -x are correlated gaus- 
sian random variables, and the averages can be calculated by integrating over their joint 
probability distribution with appropriate boundaries. In many cases, simple geometrical 
calculations give the same result with less effort. 
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(41) 



(x-xe(-(7i<7 2 )) 




(42) 




(45) 



(44) 



(43) 
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(x-wi<7 2 ) = y ^w 1 cos(9); (46) 
\ _ 2w 2 sin(g) 2 

■/opt/ - ./v rns rm ' 14 ^ 



^7T COS(^) ' 

(/opt*-W2<ri) =0; (48) 

/ = /^^Lexp P^ff (^(-xcot^jc&^cot^)))- 1 ^; (49) 

;/opt)=^Me) 2 /; (50) 

(/op t X'W lCTl }=— — /. (51) 



22 



REFERENCES 



[1] J. Hertz, A. Krogh, and R. Palmer, Introduction to the Theory of Neural Computation 
(Addison- Wesley, Redwood City, 1991). 

[2] M. Opper and W. Kinzel, in Models of Neural Networks III, edited by E. Domany, J. 
van Hemmen, and K. Schulten (Springer Verlag, Heidelberg, 1995), Chap. Statistical 
Mechanics of Generalization, pp. 151-209. 

[3] On-line Learning in Neural Networks, edited by D. Saad (Cambrigde University Press, 
Cambridge, 1998). 

[4] M. Biehl and P. Riegler, Europhys. Lett. 28, 525 (1994). 

[5] E. Eisenstein, I. Kanter, D. Kessler, and W. Kinzel, Phys. Rev. Letters 74, 6 (1995). 

[6] M. Schroder and W. Kinzel, J. Phys. A 31, 9131 (1998). 

[7] D. H. Wolpert and K. Turner, |cs.LG/99080T4] (unpublished). 



[8] M. Biehl and H. Schwarze, J. Phys. A 26, 2651 (1993). 

[9] W. B. Arthur, Am. Econ. Assoc. Papers and Proc 84, 406 (1994). 

[10] D. ChaUet and Y.-C. Zhang, Physica A 246, 407 (1997). 

[11] M. Marsili, D. Challet, and R. Zecchina, |cond-mat / 9908480| (unpublished), 
D. Challet, M. Marsili, and R. Zecchina, Phys. Rev. Lett. 84, 1824 (2000), 
D. Challet and Y.-C. Zhang, Physica A 256, 514 (1998), 
R. Savit, R. Manuca, and R. Riolo, Phys. Rev. Lett. 82, 2203 (1999), 
D. Challet and M. Marsili, Phys. Rev. E 60, R6271 (1999). 

[12] G. Reents and R. Urbanczik, Phys. Rev. Lett. 80, 5445 (1998). 

[13] H. Zhu and W. Kinzel, Neural Computation 10, 2219 (1998). 

[14] O. Kinouchi and N. Caticha, J. Phys. A 25, 6243 (1992). 

23 



[15] D. Challet, M. Marsili, and Y.-C. Zhang, [cond-mat/9909265| (unpublished). 



[16] N. F. Johnson, M. Hart, P. M. Hui, and D. Zheng, |cond-mat/9910072| (unpublished) 



[17] A. Cavagna, Phys. Rev. E 59, R3783 (1999). 



24 



FIGURES 




-1 



o 2 perceptrons, simulation 

□ 3 
4 
5 

2 perceptrons, analytical 

3 

4 

5 



x-x-x ^ x-x-x -X X-X-X x-x 



HH3R33HBQEE3 



-1/4 
-1/3 

-1/2 



3 

ri/w 



FIG. 1. Mutual learning with rule P (see sections IA and [Tel ): comparison between Eqs. (0) 
and ([To| ) and the stationary state in simulations with N = 100 and a > 75. 
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FIG. 2. Mutual learning with rule P: sketch of the geometrical interpretation. See section I A 
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FIG. 3. Mutual learning with rule PN (cf. sections IB and IC): the system follows Eq. @ 
for rj < rjc- Simulations used N = 100. 
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FIG. 4. Mutual learning with rule H: simulations with N = 100 show good agreement with 
Eqs. ([ll]), except for very small angles 9. 
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FIG. 5. Competing learning aims with normalized weights: 7/2 is set to 1 while 771 is varied. 
The analytical curves are fixed points of Eqs. fll8|), fll9| ) and (p0|), respectively. 
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FIG. 6. Confused teacher: Even with the optimal weight function fl27| ) the student only achieves 
an overlap of cos(#) = 0. Starting values are w\ = W2 = v / 2vr/4, cos(#) = 1, 77 = 1. Simulations 
are performed with N = 2000; the statistical error is smaller than the size of the symbols.. 
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FIG. 7. Confused teacher: If the teacher is slowed down by normalizing its weight, it can be 
tracked by a student using e.g. rule HN. The figure shows the fixed point of ( |3ll) and simulations 
with N = 100. 
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FIG. 8. Fixed point of C vs. r\: simulations with N = 100 agree well with Eq. (|4C 
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FIG. 9. Fixed point of a 2 /K vs. rj: the combination of Eqs.(|33|) and ( [40"| ) shows that sufficiently 
small learning rates lead to a 2 /K < 1. 
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FIG. 10. Using a modified PN algorithm improves the results, compared to Fig. ^. Simulations 
again use N = 100. 
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